Search results for " High-dimensional data"
showing 10 items of 24 documents
Penalized regression and clustering in high-dimensional data
The main goal of this Thesis is to describe numerous statistical techniques that deal with high-dimensional genomic data. The Thesis begins with a review of the literature on penalized regression models, with particular attention to least absolute shrinkage and selection operator (LASSO) or L1-penalty methods. L1 logistic/multinomial regression models are used for variable selection and discriminant analysis with a binary/categorical response variable. The Thesis discusses and compares several methods that are commonly utilized in genetics, and introduces new strategies to select markers according to their informative content and to discriminate clusters by offering reduced panels for popul…
An Extension of the DgLARS Method to High-Dimensional Relative Risk Regression Models
2020
In recent years, clinical studies, where patients are routinely screened for many genomic features, are becoming more common. The general aim of such studies is to find genomic signatures useful for treatment decisions and the development of new treatments. However, genomic data are typically noisy and high dimensional, not rarely outstripping the number of patients included in the study. For this reason, sparse estimators are usually used in the study of high-dimensional survival data. In this paper, we propose an extension of the differential geometric least angle regression method to high-dimensional relative risk regression models.
Regularized Regression Incorporating Network Information: Simultaneous Estimation of Covariate Coefficients and Connection Signs
2014
We develop an algorithm that incorporates network information into regression settings. It simultaneously estimates the covariate coefficients and the signs of the network connections (i.e. whether the connections are of an activating or of a repressing type). For the coefficient estimation steps an additional penalty is set on top of the lasso penalty, similarly to Li and Li (2008). We develop a fast implementation for the new method based on coordinate descent. Furthermore, we show how the new methods can be applied to time-to-event data. The new method yields good results in simulation studies concerning sensitivity and specificity of non-zero covariate coefficients, estimation of networ…
A local complexity based combination method for decision forests trained with high-dimensional data
2012
Accurate machine learning with high-dimensional data is affected by phenomena known as the “curse” of dimensionality. One of the main strategies explored in the last decade to deal with this problem is the use of multi-classifier systems. Several of such approaches are inspired by the Random Subspace Method for the construction of decision forests. Furthermore, other studies rely on estimations of the individual classifiers' competence, to enhance the combination in the multi-classifier and improve the accuracy. We propose a competence estimate which is based on local complexity measurements, to perform a weighted average combination of the decision forest. Experimental results show how thi…
Distance Functions, Clustering Algorithms and Microarray Data Analysis
2010
Distance functions are a fundamental ingredient of classification and clustering procedures, and this holds true also in the particular case of microarray data. In the general data mining and classification literature, functions such as Euclidean distance or Pearson correlation have gained their status of de facto standards thanks to a considerable amount of experimental validation. For microarray data, the issue of which distance function works best has been investigated, but no final conclusion has been reached. The aim of this extended abstract is to shed further light on that issue. Indeed, we present an experimental study, involving several distances, assessing (a) their intrinsic sepa…
SparseHC: A Memory-efficient Online Hierarchical Clustering Algorithm
2014
Computing a hierarchical clustering of objects from a pairwise distance matrix is an important algorithmic kernel in computational science. Since the storage of this matrix requires quadratic space with respect to the number of objects, the design of memory-efficient approaches is of high importance to this research area. In this paper, we address this problem by presenting a memory-efficient online hierarchical clustering algorithm called SparseHC. SparseHC scans a sorted and possibly sparse distance matrix chunk-by-chunk. Meanwhile, a dendrogram is built by merging cluster pairs as and when the distance between them is determined to be the smallest among all remaining cluster pairs. The k…
Sparse relative risk regression models
2020
Summary Clinical studies where patients are routinely screened for many genomic features are becoming more routine. In principle, this holds the promise of being able to find genomic signatures for a particular disease. In particular, cancer survival is thought to be closely linked to the genomic constitution of the tumor. Discovering such signatures will be useful in the diagnosis of the patient, may be used for treatment decisions and, perhaps, even the development of new treatments. However, genomic data are typically noisy and high-dimensional, not rarely outstripping the number of patients included in the study. Regularized survival models have been proposed to deal with such scenarios…
Structural clustering of millions of molecular graphs
2014
We propose an algorithm for clustering very large molecular graph databases according to scaffolds (i.e., large structural overlaps) that are common between cluster members. Our approach first partitions the original dataset into several smaller datasets using a greedy clustering approach named APreClus based on dynamic seed clustering. APreClus is an online and instance incremental clustering algorithm delaying the final cluster assignment of an instance until one of the so-called pending clusters the instance belongs to has reached significant size and is converted to a fixed cluster. Once a cluster is fixed, APreClus recalculates the cluster centers, which are used as representatives for…
A fast and recursive algorithm for clustering large datasets with k-medians
2012
Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the $k$-means algorithm, a new class of recursive stochastic gradient algorithms designed for the $k$-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which…
A novel heuristic memetic clustering algorithm
2013
In this paper we introduce a novel clustering algorithm based on the Memetic Algorithm meta-heuristic wherein clusters are iteratively evolved using a novel single operator employing a combination of heuristics. Several heuristics are described and employed for the three types of selections used in the operator. The algorithm was exhaustively tested on three benchmark problems and compared to a classical clustering algorithm (k-Medoids) using the same performance metrics. The results show that our clustering algorithm consistently provides better clustering solutions with less computational effort.